analytics - Regression in R

Multiple Linear Regression

Multiple linear regression is a statistical method used to model the relationship between multiple independent variables and a single dependent variable. By using this method, we can make predictions about the value of the dependent variable based on the values of the independent variables.

Data

We have state level census data on various socio-economic and demographic data. The data consists of the following variables:

statedata<-read.csv("StateData.csv", head=T)
names(statedata)

 [1] "State"                       "OwnComputer"                
 [3] "CommutePublicTransport"      "TotalPopulation"            
 [5] "MedianAge"                   "WithCashAssistanceIncome"   
 [7] "MeanSocialSecurityIcnome"    "SupplementarySecurityIncome"
 [9] "WhiteOnly"                   "Latinos"                    
[11] "Asians"                      "AfricanAmerican"            
[13] "Income100K.150K"             "Income75K.100K"             
[15] "Income50K.75K"               "Income35K.50K"              
[17] "Income25K.35K"               "PovertyRate"

library(dplyr)

Warning: package 'dplyr' was built under R version 4.2.3


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(car)

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.2.3

library(knitr)

Warning: package 'knitr' was built under R version 4.2.3

library(ggfortify)

Warning: package 'ggfortify' was built under R version 4.2.2

Some variables are presented in percentage points as a fraction of the total population. Below is a snapshot of our data.

knitr::kable(head(statedata))

State	OwnComputer	CommutePublicTransport	TotalPopulation	MedianAge	WithCashAssistanceIncome	MeanSocialSecurityIcnome	SupplementarySecurityIncome	WhiteOnly	Latinos	Asians	AfricanAmerican	Income100K.150K	Income75K.100K	Income50K.75K	Income35K.50K	Income25K.35K	PovertyRate
Minnesota	90.3	198984.888	5527358	37.9	3.4	20192	4.2	80.3	0.1	1.2	7.4	17.2	14.0	18.0	12.1	8.2	10.1
Mississippi	81.5	8966.286	2988762	37.2	2.2	17667	7.9	56.8	0.1	1.2	38.4	10.4	11.2	16.9	14.1	11.4	20.8
Missouri	87.3	85260.868	6090062	38.5	1.9	19054	5.5	79.6	0.1	2.5	12.8	13.0	12.5	18.6	14.4	10.6	14.2
Montana	87.3	8333.856	1041732	39.8	2.2	18696	4.4	86.3	0.1	1.3	1.0	12.6	12.2	19.1	14.4	11.1	13.7
Nebraska	88.5	13333.320	1904760	36.4	1.9	19673	3.9	79.4	0.3	2.9	5.9	15.1	13.7	19.3	13.8	9.8	11.6
Nevada	91.2	99376.866	2922849	37.9	3.0	19141	4.1	49.9	1.0	10.1	10.6	14.1	13.4	19.1	14.1	10.1	13.7

statedata %>%
  summarise(
    `Median OwnComputer` = median(OwnComputer),
    `IQR OwnComputer` = IQR(OwnComputer),
    `Median PovertyRate` = median(PovertyRate),
    `IQR PovertyRate` = IQR(PovertyRate),
    `Correlation, r` = cor(OwnComputer, PovertyRate)
    ) %>%
  kable(digits = c(0, 0, 0, 0, 2))

Summary statistics for OwnComputer and PovertyRate variables in statedata
Median OwnComputer	IQR OwnComputer	Median PovertyRate	IQR PovertyRate	Correlation, r
88	3	14	5	-0.33

Our target variable is OwnComputer, the percentage of people who own a computer. It may not be an interesting question, yet, in this exercise, we will try to find the factors that determine our target variable.

In this blog post, we’ll explore how to conduct simple and multiple linear regression in R using the lm() function and interpret the regression output.

First, let’s load the statedata dataset into R and create a simple linear regression model with “OwnComputer” as the dependent variable and “PovertyRate” as the predictor.

model1: \[OwnComputer_{i}=\beta_{0}+\beta_{1}{PovertyRate}_{i}+\epsilon_{i}, ~~~~i=1,2,…,52\]

# Create a simple linear regression model with "OwnComputer" as the dependent variable and "CommutePublicTransport" as the predictor
model1 <- lm(OwnComputer ~ PovertyRate, data = statedata)

We can summarize the results of the simple linear regression model using the summary() function:

# Summarize the results of model1
summary(model1)


Call:
lm(formula = OwnComputer ~ PovertyRate, data = statedata)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.0341 -1.3216 -0.2835  1.6597  7.8561 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 91.09419    1.12062  81.289   <2e-16 ***
PovertyRate -0.18315    0.07402  -2.474   0.0168 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.731 on 50 degrees of freedom
Multiple R-squared:  0.1091,    Adjusted R-squared:  0.09127 
F-statistic: 6.122 on 1 and 50 DF,  p-value: 0.01679

This will output a summary of the regression results, including the coefficients, standard errors, t-values, p-values, and R-squared value. We can use this information to interpret the results of the regression analysis.

The estimates of intercept $(\hat\beta_{0})$ and the slope $(\hat\beta_{2} )$ of model1 can be extracted with model1$ceofficients.

The intercept estimate in model1 is 91.0941893 which means that the predicted value of “OwnComputer” when “PovertyRate” is zero is 91.0941893 .

The slope estimate of -0.1831527 tells us that for every one-unit increase in “PovertyRate”, the predicted value of “OwnComputer” changes by -0.1831527

Std.error: a measure of precision of the estimate of the coefficient. Lower the Std. error, better the fit.

t -value is the ratio of coefficient and its standard error
$Pr(>|t|)$ The probability that we could get a value of $\hat\beta$ that large were the null true (Null: $\beta=0$)
p-value <0.05: coefficient is significantly different from zero.
p-value< 0.01: coefficient is statistically significantly at $1\%$ level

Multiple R-squared: This value measures the proportion of variation in the dependent variable OwnComputer that can be explained by the independent variable(s). In here is it PovertyRate. We can extract the Multiple R-squared value from the summary with summary(model1)$r.squared line of code.

In model1, the Multiple R-squared value is 0.1090863, which means that approximately 10.91% of the variation in “OwnComputer” can be explained by “PovertyRate”.

Residual standard error: This value measures the average amount that the response variable OwnComputer deviates from the predicted value by the model. We can extract the Residual standard error value from the summary with summary(model1)$sigma line of code.

In model1, the Residual standard error is 2.7313536 , which means that the predicted value of “OwnComputer” using “PovertyRate” as a predictor is typically off by about 2.7313536 units.

Adjusted R-squared: This value is a modified version of the Multiple R-squared value that adjusts for the number of independent variables in the model. We can extract the Adjusted R-squared value from the summary with summary(model1)$adj.r.squared line of code. The Adjusted R-squared value will increase only if the new variable increases the explanatory power of the model. The formula for the Adjusted R-squared= $1-\frac{(1-R^2)(n-1)}{n-k}$ where n is the number of observations, k is the number of independent variables including the constant.

In model1, the Adjusted R-squared value is 0.091268, which is slightly lower than the Multiple R-squared value. This indicates that the model is not overfitting the data.

F-statistic: This value measures the overall significance of the model. In other words, F-statistics tests of the null that all coefficients (excluding constant) are equal to zero. The p-value for the F statistics captures the probabilty that an F statistics this large could occur by chance while the null is true (no relationship). If p-value <level of significance, the model is significant. At least one predictor (excluding constant) is significant.

We can extract the F statistics value from the summary with summary(model1)$fstatistic[1] line of code.

In model1, the F-statistic is 6.1221559 with a p-value of 0.01679, which means that the model as a whole is statistically significant.

p-value: This value measures the statistical significance of the independent variable(s).

To extract the point estimate for the intercept, we can use the summary(model1)$coefficients[1] line of code.

To extract the standard error for the point estimate for the intercept, we can use the summary(model1)$coefficients[1,2] line of code.

To extract the test statistics for the intercept, we can use the summary(model1)$coefficients[1,3] line of code.

To extract the p value for the test statistics for the intercept, we can use the summary(model1)$coefficients[1,4] line of code.

Likewise, the point estimate, the standard error, t-value and the p value for slope coefficient can be extracted with summary(model1)$coefficients[2], summary(model1)$coefficients[2,2] , summary(model1)$coefficients[2, 3] and summary(model1)$coefficients[2, 4] lines of codes, respectively.

Just note that we can always use the update() function in R to make our model larger with more covariates. If we want to create a larger model that includes “MedianAge”, “Latinos”, and “Income100K.150K” as predictors, we can use the following line of code to update model1 :

model2 <- update(model1, . ~ . + MedianAge + Latinos + Income100K.150K). If, on the other hand, we want to create a large model with all predictors but State and MedianAge, we can use the following line of code: model3 <- lm(OwnComputer ~ .-State -MedianAge, data = statedata)

Hypothesis Testing after Regression

In a regression model, we often want to test the hypothesis that the slope coefficient for a particular independent variable is equal to null value. This hypothesis can be tested using a t-test, which compares the estimated slope coefficient to the null value.

$\hat\beta_{i} \sim N(\beta_{i}, se(\beta_{i}))$

$\frac{\hat\beta_{i}-\beta_{i}}{ se(\beta_{i})}\sim t(n-2)$

Let’s say that we want to test the null $H_{0}:~~\beta_{1}>=-0.15$ with the alternative hypothesis that $H_{1}:~~\beta_{1}<-0.15$, a one-sided test.
recall: $\hat\beta_{1}=$ -0.1831527 and $se(\beta_{1})=$ 0.0740221
Test statistics=$\frac{\hat\beta_{i}-\beta_{i}}{ se(\beta_{i})}$

We can create a new R object to store the model1 summary with summary_model1=summary(model1) line of code. We can extract the slope coefficient from the summary with the slope_coef <- summary_model1$coefficients[2,1] line of code, and the standard error of the slope coefficient with the std_error <- summary_model1$coefficients[2,2] line of code. Finally, we can create an R object to store the null value with the null_value<--0.15 line of code

The resulting test statistics can be calculated with the t_value <- (slope_coef-null_value)/std_error line of code.

In calculating the p value, we also need to calculate the degrees of freedom as we use t-distribution to test the null. We can extract the degrees of freedom from model summary with the df <- summary_model1$df[2] line of code. Now, we can calculate the p-value for the one-sided test with the following line of code: p_value <- pt(t_value, df) . Here, the pt(q,df) function calculates the cumulative distribution function of the t-distribution. It takes two arguments: the quantile value q, which represents the t-value in the t-distribution, and the degrees of freedom df, which determine the shape of the t-distribution. The resulting value gives the probability of observing a t-value less than or equal to q in the t-distribution with df degrees of freedom.

Our test-statistics is -0.4478761 with a p-value of 0.3280886 . At conventional significance level, we fail to reject the null of $H_{0}:~~\beta_{1}>=-0.15$ .

Checking for Outliers or Influential Points

In regression analysis, it is important to check for outliers and influential points that may have a large effect on the model results. We define leverage points as the points that are far from the mean of the X’s. Likewise, the influential points are the the leverage points that can influence the model fit. One can use standardized residuals to classify observations as an outlier or not. The formula for the Standardized residuals is:

\[r_{i}^*=\frac{y_{i}-\hat y_{i}}{sqrt(MSE)}\]

Then, one can use different threshold level to classify observations.

\[|r_{i}^*| >1:~~~~~~~~Large\]
\[|r_{i}^*| >2:~~~~Extremely ~~Large\]

Cook’s distance measure is another method to measure the impact of EACH observation on the fitted values of the model.

We can use the cooks.distance() function to calculate Cook’s distance for each observation in the dataset. Since we need residuals, we need a model first to use the fitted values to calculate the distance measure. To calculate the Cook’s distance measures by using the model1 fitted values, we can use the model1_cookd<-cooks.distance(model1) line of code to store the distance measures under model1_cookd R object. Let’s plot the distance measures below.

model1_cookd<-cooks.distance(model1)
plot(model1_cookd,type="h", ylab="Cook's Distance" )

We can detect the row numbers of observations with a Cook’s distance greater than a certain threshold (e.g., 4/n, 1) with the which(cooksd > threshold) line of code.

outlier_points <- which(model1_cookd > 1)

If you want to remove the outlier from statedata and store the new dataset under a new name, we can use the following line of code: statedata1<- statedata[-outlier_points,] which removes from statedata the outlier and save the reduced dataset as statedata1.

Checking for Multi-collinearity

Multicollinearity can be a problem in regression analysis when predictor variables are highly correlated with each other. We can use the variance inflation factor (VIF) to check for this and remove any highly correlated variables from our model.

VIF measure of how much the variance of an estimated regression coefficient is increased due to multicollinearity. In practice, any VIF value greater than 10 is a sign of Multicollinearity, and should be removed from the model. Let’s define modelfull which takes all the columns but State variable from statedata and calculate the VIF value.

modelfull <- lm(OwnComputer ~ .-State, data = statedata)

# Calculate the VIF for each predictor variable in the model
vif_values <- vif(modelfull)

# Identify variables with a VIF greater than 10
high_vif <-names(vif_values[vif_values > 10])

# Remove the variables with high VIF values from the model and create a new model called modelfull_new
modelfull_new <- update(modelfull, ~.-Income100K.150K-Income75K.100K-Income50K.75K-Income35K.50K-Income25K.35K-PovertyRate)

The high_vif object stores the names of the variables with VIF values greater than 10. The modelfull_new excludes the variables with high VIF values.

Confidence and Prediction Interval Estimations

Finally, we will explore prediction interval estimations to determine the range of values within which we can expect our predicted values to fall.

# Create a new dataset for prediction
newstates <- data.frame(PovertyRate = c(8, 28, 35))

# Use the predict() function to make predictions
predictions <- predict(model1, newdata=newstates)

# Print the predictions
print(predictions)

       1        2        3 
89.62897 85.96591 84.68384

We can also use the predict() function to estimate prediction intervals for our predictions. For example, let’s say we want to estimate a 95% prediction interval for our predictions:

# Use the predict() function to estimate prediction  intervals 
pred_int <- predict(model1, newdata=newstates, interval = "prediction", level = 0.95)

# Print the prediction  intervals
print(pred_int)

       fit      lwr      upr
1 89.62897 84.01302 95.24492
2 85.96591 80.06199 91.86984
3 84.68384 78.34387 91.02382

This will output a matrix of lower and upper bounds for the prediction intervals of our predictions. The interval = "prediction" option estimates a prediction interval for OwnComputer by accounting for the noise and variation in the new data set.

The interval = "confidence" option on the other hand estimates a confidence interval for OwnComputer under the assumption that the new observations follow the same patterns as the original dataset, assuming that the regression model is correct and the errors are normally distributed.

In general, prediction intervals are wider than confidence intervals because they include the uncertainty due to the variation of the errors around the expected value of the response variable.

As can be seen below, the confidence interval ranges are narrower:

# Use the predict() function to estimate confidence  intervals
confint <- predict(model1, newdata=newstates, interval = "confidence", level = 0.95)

# Print the confidence  intervals
print(confint)

       fit      lwr      upr
1 89.62897 88.42824 90.82969
2 85.96591 83.78436 88.14747
3 84.68384 81.50608 87.86160